Add RVV (RISC-V Vector Extension) optimized convolution and pooling kernels for the NCHWc blocked format in MLAS by velonica0 · Pull Request #28411 · microsoft/onnxruntime

velonica0 · 2026-05-08T10:06:46Z

Description

New kernel files:

riscv64/sconv_depthwise_kernel_rvv.cpp — RVV-optimized 3x3 stride-1 depthwise convolution (NCHW format), replacing the MLAS_FLOAT32X4 generic vectorized version
riscv64/sconv_nchwc_kernel_rvv.cpp — 7 NCHWc kernels using vfloat32m4_t (LMUL=4, BlockSize=16):
- Direct NCHW conv (MlasConvNchwFloatKernelRvv)
- Direct NCHWc conv (MlasConvNchwcFloatKernelRvv)
- Depthwise NCHWc conv (MlasConvDepthwiseFloatKernelRvv)
- Pointwise NCHWc conv (MlasConvPointwiseFloatKernelRvv)
- Max/AvgExcludePad/AvgIncludePad pooling

Motivation and Context

Following #28261, Optimize more MLAS kernels using RISC-V Vector (RVV) extensions.

Please Note:

On the K3 (SpacemiT X60), VLEN=256. With LMUL=4 and e32, the hardware can hold (256/32) * 4 = 32 floats per vector register group — but we only request 16. So we're using half the available vector width.
The reason is that BlockSize=16 is baked into the NCHWc data layout across the whole framework (matching ARM64 NEON). Changing it to 32 would require a different NCHWc format and is not a localized change.

Benchmark ((SpacemiT K3, VLEN=256, 8-core))

All tests pass with zero numerical error.

Kernel	Speedup (RVV vs scalar)
Direct NCHW Conv	1.27–1.29x
Direct NCHWc Conv	1.93–1.95x
Depthwise NCHWc Conv	10.8–12.5x
Pointwise NCHWc Conv	29.4–30.4x
Max Pooling	12.5–20.0x
Avg Pooling (exclude pad)	4.0–4.3x
Avg Pooling (include pad)	5.5–5.8x

velonica0 · 2026-05-08T10:11:32Z

Hi @hariharans29
Could you please take a look at this PR when you have a moment? I’d really appreciate your help.

Copilot

Pull request overview

This PR extends MLAS’ riscv64 RVV support by wiring up optimized float32 convolution and NCHWc pooling kernels, enabling the NCHWc blocked-format fast paths (BlockSize=16) on RVV-capable systems and replacing the previous generic depthwise implementation.

Changes:

Adds new riscv64 RVV kernel implementations for direct NCHW/NCHWc conv, depthwise/pointwise NCHWc conv, and max/avg pooling in NCHWc format.
Wires the new kernels into MLAS_PLATFORM initialization for riscv64 when RVV is available, and enables NCHWc fast-path selection for RISCV64+RVV.
Updates build configuration to compile the new RVV sources and swap out the previous depthwise kernel source for RISCV64 builds.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
onnxruntime/core/mlas/lib/snchwc.cpp	Enables RISCV64+RVV to use platform-selected NCHWc conv/pool kernels and block size.
onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp	New RVV implementations for NCHW/NCHWc conv, depthwise/pointwise NCHWc conv, and NCHWc pooling.
onnxruntime/core/mlas/lib/riscv64/sconv_depthwise_kernel_rvv.cpp	New RVV 3x3 s1 depthwise CHW kernel implementation for the multiplier-1 path.
onnxruntime/core/mlas/lib/platform.cpp	Registers RVV NCHWc conv/pool kernels and sets NCHWc block size for riscv64 when RVV is present.
onnxruntime/core/mlas/lib/mlasi.h	Declares new RVV kernel entry points and adds RISCV64+RVV NCHWc members to `MLAS_PLATFORM`.
onnxruntime/core/mlas/lib/convolve.cpp	Enables the depthwise-direct algorithm path on RISCV64 and updates stride restrictions comment/logic.
onnxruntime/core/mlas/inc/mlas.h	Exposes `MlasConvAlgorithmDepthwise` in the enum for RISCV64 builds.
cmake/onnxruntime_mlas.cmake	Adds new RVV kernel sources and adjusts which depthwise source is compiled for RISCV64/RVV.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

velonica0 · 2026-05-11T01:51:50Z

Both comments point to the same concern: "What if $vl < 16$?"

In practice, however, this is not an issue because the minimum VLEN on current RISC-V CPUs is 128 bits. Therefore, this case is effectively covered (though I have implemented changes regardless).

The calculation is as follows:

__riscv_vsetvl_e32m4(avl) returns $\min(avl, VLMAX)$, where:

$$VLMAX = (VLEN / SEW) \times LMUL = (VLEN / 32) \times 4$$

hariharans29 · 2026-05-11T17:07:47Z

Both comments point to the same concern: "What if v l < 16 ?"

In practice, however, this is not an issue because the minimum VLEN on current RISC-V CPUs is 128 bits. Therefore, this case is effectively covered (though I have implemented changes regardless).

The calculation is as follows:

__riscv_vsetvl_e32m4(avl) returns min ( a v l , V L M A X ) , where:

V L M A X = ( V L E N / S E W ) × L M U L = ( V L E N / 32 ) × 4

Thanks for the clarification. It looks good to me. Let me get one last opinion from Copilot.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

 #include "kleidiai/mlasi_kleidiai.h"
 #endif

+#if defined(MLAS_USE_RVV)


+        acc = __riscv_vfadd_vv_f32m4(acc, old_output, vl);
+    }
+
+    if (KernelFlags & MLAS_CONV_KERNEL_FLAG_BIAS_ADDITION) {


+        vfloat32m4_t sum_vec = __riscv_vfmv_v_f_f32m4(0.0f, vl);
+        uint32_t valid_count[BlockSize];
+
+        if (ExcludePad) {
+            for (size_t i = 0; i < BlockSize; i++) {
+                valid_count[i] = 0;
+            }
+        }


+            float results[BlockSize];
+            __riscv_vse32_v_f32m4(results, sum_vec, vl);
+            for (size_t i = 0; i < BlockSize; i++) {
+                results[i] /= static_cast<float>(valid_count[i]);


+    const size_t pad_top = Parameters->Padding[0];
+    const size_t pad_left = Parameters->Padding[1];
+    const size_t pad_right = Parameters->Padding[3];


+        if (pad_left && ow < out_cols) {
+            const ptrdiff_t iw = static_cast<ptrdiff_t>(ow) - static_cast<ptrdiff_t>(pad_left);
+            float acc = DepthwiseComputeEdge(
+                row0, row1, row2, iw, W,
+                w00, w01, w02, w10, w11, w12, w20, w21, w22
+            );
+            if (accumulate_output) {
+                acc += beta * out_row[ow];
+            }
+            out_row[ow++] = acc;
+        }


+            out_row[ow++] = acc;
+        }
+
+        if (pad_right && ow < out_cols) {


+            for (size_t kh = 0; kh < KernelHeight; kh++) {
+                for (size_t kw = 0; kw < KernelWidth; kw++) {
+                    const float* input_base = Input + output_idx * StrideWidthElements +
+                                              kh * DilatedInputWidthElements + kw * DilationWidthElements;
+
+                    const float* input_row_start = InputBase + kh * DilatedInputWidthElements;
+                    const float* input_row_end = input_row_start + InputWidthElements;
+
+                    size_t kernel_pos = kh * KernelWidth + kw;
+
+                    for (size_t ic = 0; ic < BlockSize; ic++) {
+                        const float* input_element = input_base + ic;
+
+                        float input_value = 0.0f;
+                        if (input_element >= input_row_start && input_element < input_row_end) {
+                            input_value = *input_element;
+                        }
+
+                        size_t filter_offset = kernel_pos * BlockSize * BlockSize + ic * BlockSize;
+                        vfloat32m4_t filt = __riscv_vle32_v_f32m4(&filter[filter_offset], vl);
+                        acc = __riscv_vfmacc_vf_f32m4(acc, input_value, filt, vl);
+                    }


hariharans29 · 2026-05-11T17:27:55Z

Copilot generated a lot more comments in this round - can you please take a look ?

convolution and pooling kernels

99f3914

hariharans29 requested a review from Copilot May 10, 2026 21:53

Copilot started reviewing on behalf of hariharans29 May 10, 2026 21:54 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/platform.cpp Outdated

Comment thread onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp

Copilot

7428f0c

hariharans29 requested a review from Copilot May 11, 2026 17:07

Copilot AI reviewed May 11, 2026

View reviewed changes

Copilot started reviewing on behalf of hariharans29 May 11, 2026 17:38 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RVV (RISC-V Vector Extension) optimized convolution and pooling kernels for the NCHWc blocked format in MLAS#28411

Add RVV (RISC-V Vector Extension) optimized convolution and pooling kernels for the NCHWc blocked format in MLAS#28411
velonica0 wants to merge 2 commits into
microsoft:mainfrom
velonica0:rvv_pr

velonica0 commented May 8, 2026 •

edited

Loading

Uh oh!

velonica0 commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

velonica0 commented May 11, 2026

Uh oh!

hariharans29 commented May 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

hariharans29 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

velonica0 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Benchmark ((SpacemiT K3, VLEN=256, 8-core))

Uh oh!

velonica0 commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

velonica0 commented May 11, 2026

Uh oh!

hariharans29 commented May 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

hariharans29 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

velonica0 commented May 8, 2026 •

edited

Loading